Learning Objectives:


Manipulating and Visualizing Data Frames

Last week you started to manipulate data tables (under the class of "data.frame" objects) using bracket notation, dat[ , ], and the dollar operator, dat$name, in order to select specific rows, columns, or cells. In addition, you have been creating charts with functions like plot(), boxplot(), and barplot(), which are part of the "graphics" package.

In this lab, you will start learning about other approaches to manipulate tables and create statistical charts. We are going to use the functionality of the package "dplyr" to work with tabular data in a more consistent way. This is a fairly recent package introduced a couple of years ago, but it is based on more than a decade of research and work lead by Hadley Wickham.

Likewise, to create graphics in a more consistent and visually pleasing way, we are going to use the package "ggplot2", also originally authored by Hadley Wickham, and developed as part of his PhD more than a decade ago.

Use the first hour of the lab to get as far as possible with the material associated to "dplyr". Then use the second hour of the lab to work on graphics with "ggplot2".

While you follow this lab, you may want to open these cheat sheets:


Filestructure and Shell Commands

We want you to keep practicing with the command line (e.g. Mac Terminal, Gitbash). Follow the steps listed below to create the necessary subdirectories like those depicted in this scheme:

    lab05/
      README.md
      data/
        nba2017-players.csv
      report/
        lab05.Rmd
        lab05.html
      images/
        ... # all the plot files
Last login: Wed Feb 14 14:58:35 on ttys001 airbears2-10-142-129-9:~ XuewenLi$ mkdir lab05 airbears2-10-142-129-9:~ XuewenLi$ cd lab05 airbears2-10-142-129-9:lab05 XuewenLi$ mkdir data report images airbears2-10-142-129-9:lab05 XuewenLi$ ls lab05 ls: lab05: No such file or directory airbears2-10-142-129-9:lab05 XuewenLi$ ls data images report airbears2-10-142-129-9:lab05 XuewenLi$ touch usage: touch [-A [-][[hh]mm]SS] [-acfhm] [-r file] [-t [[CC]YY]MMDDhhmm[.SS]] file … airbears2-10-142-129-9:lab05 XuewenLi$ touch READ.md airbears2-10-142-129-9:lab05 XuewenLi$ cd data airbears2-10-142-129-9:data XuewenLi$ curl -O https://raw.githubusercontent.com/ucb-stat133/stat133-spring-2018/master/data/nba2017-players.csv % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 39752 100 39752 0 0 111k 0 –:–:– –:–:– –:–:– 111k airbears2-10-142-129-9:data XuewenLi$ wc nba2017-players.csv 442 1632 39752 nba2017-players.csv airbears2-10-142-129-9:data XuewenLi$ head -1 nba2017-players.csv “player”,“team”,“position”,“height”,“weight”,“age”,“experience”,“college”,“salary”,“games”,“minutes”,“points”,“points3”,“points2”,“points1” airbears2-10-142-129-9:data XuewenLi$ tail -5 nba2017-players.csv “Marquese Chriss”,“PHO”,“PF”,82,233,19,0,“University of Washington”,2941440,82,1743,753,72,212,113 “Ronnie Price”,“PHO”,“PG”,74,190,33,11,“Utah Valley State College”,282595,14,134,14,3,1,3 “T.J. Warren”,“PHO”,“SF”,80,230,23,2,“North Carolina State University”,2128920,66,2048,951,26,377,119 “Tyler Ulis”,“PHO”,“PG”,70,150,21,0,“University of Kentucky”,918369,61,1123,444,21,163,55 “Tyson Chandler”,“PHO”,“C”,85,240,34,15,“”,12415000,47,1298,397,0,153,91 airbears2-10-142-129-9:data XuewenLi$
### Installing packages
I’m assuming that you already installed the packages "dplyr" and "ggplot2". If that’s not the case then run on the console the command below (do NOT include this command in your Rmd):
Remember that you only need to install a package once! After a package has been installed in your machine, there is no need to call install.packages() again on the same package. What you should always invoke in order to use the functions in a package is the library() function:
r # (include these commands in your Rmd file) # don't forget to load the packages library(dplyr) library(ggplot2) library(readr)
About loading packages: Another rule to keep in mind is to always load any required packages at the very top of your script files (.R or .Rmd or .Rnw files). Avoid calling the library() function in the middle of a script. Instead, load all the packages before anything else.
### Path for Images
r knitr::opts_chunk$set(echo = T, fig.path="../images/")
If you don’t specify fig.path, "knitr" will create a default directory to store all the plots produced when knitting an Rmd file. This time, however, we want to have more control over where things are placed. Because you already have a folder images/ as part of the filestructure, this is where we want "knitr" to save all the generated graphics.
Notice the use of a relative path fig.path = '../images/'. This is because your Rmd file should be inside the folder report/, but the folder images/ is outside report/ (i.e. in the same parent directory of report/).

NBA Players Data

dat <- read.csv('../data/nba2017-players.csv', stringsAsFactors = FALSE )

nba_data <- '../data/nba2017-players.csv'

The data file for this lab is the same you used last week: nba2017-players.csv.

To import the data in R you can use the base function read.csv(), or you can also use read_csv() from the package "readr":

# with "base" read.csv()
dat <- read.csv(nba_data, stringsAsFactors = FALSE)

# with "readr" read_csv()
dat <- read_csv(nba_data)

Basic "dplyr" verbs

To make the learning process of "dplyr" gentler, Hadley Wickham proposes beginning with a set of five basic verbs or operations for data frames (each verb corresponds to a function in "dplyr"):

I’ve slightly modified Hadley’s list of verbs:


Filtering, slicing, and selecting

slice() allows you to select rows by position:

# first three rows
three_rows <- slice(dat, 1:3)
## Warning: package 'bindrcpp' was built under R version 3.3.2
three_rows
## # A tibble: 3 x 15
##          player  team position height weight   age experience
##           <chr> <chr>    <chr>  <int>  <int> <int>      <int>
## 1    Al Horford   BOS        C     82    245    30          9
## 2  Amir Johnson   BOS       PF     81    240    29         11
## 3 Avery Bradley   BOS       SG     74    180    26          6
## # ... with 8 more variables: college <chr>, salary <dbl>, games <int>,
## #   minutes <int>, points <int>, points3 <int>, points2 <int>,
## #   points1 <int>

filter() allows you to select rows by condition:

# subset rows given a condition
# (height greater than 85 inches)
gt_85 <- filter(dat, height > 85)
gt_85
##               player team position height weight age experience
## 1        Edy Tavares  CLE        C     87    260  24          1
## 2   Boban Marjanovic  DET        C     87    290  28          1
## 3 Kristaps Porzingis  NYK       PF     87    240  21          1
## 4        Roy Hibbert  DEN        C     86    270  30          8
## 5      Alexis Ajinca  NOP        C     86    248  28          6
##                 college  salary games minutes points points3 points2
## 1                          5145     1      24      6       0       3
## 2                       7000000    35     293    191       0      72
## 3                       4317720    66    2164   1196     112     331
## 4 Georgetown University 5000000     6      11      4       0       2
## 5                       4600000    39     584    207       0      89
##   points1
## 1       0
## 2      47
## 3     198
## 4       0
## 5      29

select() allows you to select columns by name:

# columns by name
player_height <- select(dat, player, height)

Your turn:

  • use slice() to subset the data by selecting the first 5 rows.
slice(dat,1:5)
## # A tibble: 5 x 15
##              player  team position height weight   age experience
##               <chr> <chr>    <chr>  <int>  <int> <int>      <int>
## 1        Al Horford   BOS        C     82    245    30          9
## 2      Amir Johnson   BOS       PF     81    240    29         11
## 3     Avery Bradley   BOS       SG     74    180    26          6
## 4 Demetrius Jackson   BOS       PG     73    201    22          0
## 5      Gerald Green   BOS       SF     79    205    31          9
## # ... with 8 more variables: college <chr>, salary <dbl>, games <int>,
## #   minutes <int>, points <int>, points3 <int>, points2 <int>,
## #   points1 <int>
  • use slice() to subset the data by selecting rows 10, 15, 20, …, 50.
slice(dat,10:50)
## # A tibble: 41 x 15
##              player  team position height weight   age experience
##               <chr> <chr>    <chr>  <int>  <int> <int>      <int>
##  1    Jonas Jerebko   BOS       PF     82    231    29          6
##  2    Jordan Mickey   BOS       PF     80    235    22          1
##  3     Kelly Olynyk   BOS        C     84    238    25          3
##  4     Marcus Smart   BOS       SG     76    220    22          2
##  5     Terry Rozier   BOS       PG     74    190    22          1
##  6     Tyler Zeller   BOS        C     84    253    27          4
##  7    Channing Frye   CLE        C     83    255    33         10
##  8    Dahntay Jones   CLE       SF     78    225    36         12
##  9   Deron Williams   CLE       PG     75    200    32         11
## 10 Derrick Williams   CLE       PF     80    240    25          5
## # ... with 31 more rows, and 8 more variables: college <chr>,
## #   salary <dbl>, games <int>, minutes <int>, points <int>, points3 <int>,
## #   points2 <int>, points1 <int>
  • use slice() to subset the data by selecting the last 5 rows.
n <- length(dat)
slice(dat, (n-4):n)
## # A tibble: 5 x 15
##          player  team position height weight   age experience
##           <chr> <chr>    <chr>  <int>  <int> <int>      <int>
## 1 Jordan Mickey   BOS       PF     80    235    22          1
## 2  Kelly Olynyk   BOS        C     84    238    25          3
## 3  Marcus Smart   BOS       SG     76    220    22          2
## 4  Terry Rozier   BOS       PG     74    190    22          1
## 5  Tyler Zeller   BOS        C     84    253    27          4
## # ... with 8 more variables: college <chr>, salary <dbl>, games <int>,
## #   minutes <int>, points <int>, points3 <int>, points2 <int>,
## #   points1 <int>
  • use filter() to subset those players with height less than 70 inches tall.
filter(dat, height < 70)
##          player team position height weight age experience
## 1 Isaiah Thomas  BOS       PG     69    185  27          5
## 2    Kay Felder  CLE       PG     69    176  21          0
##                    college  salary games minutes points points3 points2
## 1 University of Washington 6587132    76    2569   2199     245     437
## 2       Oakland University  543471    42     386    166       7      55
##   points1
## 1     590
## 2      35
  • use filter() to subset rows of Golden State Warriors (‘GSW’).
GSW <- filter(dat, team=="GSW")
GSW
##                  player team position height weight age experience
## 1        Andre Iguodala  GSW       SF     78    215  33         12
## 2          Damian Jones  GSW        C     84    245  21          0
## 3            David West  GSW        C     81    250  36         13
## 4        Draymond Green  GSW       PF     79    230  26          4
## 5             Ian Clark  GSW       SG     75    175  25          3
## 6  James Michael McAdoo  GSW       PF     81    230  24          2
## 7          JaVale McGee  GSW        C     84    270  29          8
## 8          Kevin Durant  GSW       SF     81    240  28          9
## 9          Kevon Looney  GSW        C     81    220  20          1
## 10        Klay Thompson  GSW       SG     79    215  26          5
## 11          Matt Barnes  GSW       SF     79    226  36         13
## 12        Patrick McCaw  GSW       SG     79    185  21          0
## 13     Shaun Livingston  GSW       PG     79    192  31         11
## 14        Stephen Curry  GSW       PG     75    190  28          7
## 15        Zaza Pachulia  GSW        C     83    270  32         13
##                                  college   salary games minutes points
## 1                  University of Arizona 11131368    76    1998    574
## 2                  Vanderbilt University  1171560    10      85     19
## 3                      Xavier University  1551659    68     854    316
## 4              Michigan State University 15330435    76    2471    776
## 5                     Belmont University  1015696    77    1137    527
## 6           University of North Carolina   980431    52     457    147
## 7             University of Nevada, Reno  1403611    77     739    472
## 8          University of Texas at Austin 26540100    62    2070   1555
## 9  University of California, Los Angeles  1182840    53     447    135
## 10           Washington State University 16663575    78    2649   1742
## 11 University of California, Los Angeles   383351    20     410    114
## 12       University of Nevada, Las Vegas   543471    71    1074    282
## 13                                        5782450    76    1345    389
## 14                      Davidson College 12112359    79    2638   1999
## 15                                        2898000    70    1268    426
##    points3 points2 points1
## 1       64     155      72
## 2        0       8       3
## 3        3     132      43
## 4       81     191     151
## 5       61     150      44
## 6        2      60      21
## 7        0     208      56
## 8      117     434     336
## 9        2      54      21
## 10     268     376     186
## 11      18      20      20
## 12      41      65      29
## 13       1     172      42
## 14     324     351     325
## 15       0     164      98
  • use filter() to subset rows of GSW centers (‘C’).
filter(GSW, position=="C")
##          player team position height weight age experience
## 1  Damian Jones  GSW        C     84    245  21          0
## 2    David West  GSW        C     81    250  36         13
## 3  JaVale McGee  GSW        C     84    270  29          8
## 4  Kevon Looney  GSW        C     81    220  20          1
## 5 Zaza Pachulia  GSW        C     83    270  32         13
##                                 college  salary games minutes points
## 1                 Vanderbilt University 1171560    10      85     19
## 2                     Xavier University 1551659    68     854    316
## 3            University of Nevada, Reno 1403611    77     739    472
## 4 University of California, Los Angeles 1182840    53     447    135
## 5                                       2898000    70    1268    426
##   points3 points2 points1
## 1       0       8       3
## 2       3     132      43
## 3       0     208      56
## 4       2      54      21
## 5       0     164      98
  • use filter() and then select(), to subset rows of lakers (‘LAL’), and then display their names.
LAL <- filter(dat, team =="LAL")
select(LAL, player)
##               player
## 1     Brandon Ingram
## 2       Corey Brewer
## 3   D'Angelo Russell
## 4        David Nwaba
## 5        Ivica Zubac
## 6    Jordan Clarkson
## 7      Julius Randle
## 8    Larry Nance Jr.
## 9          Luol Deng
## 10 Metta World Peace
## 11        Nick Young
## 12       Tarik Black
## 13   Thomas Robinson
## 14    Timofey Mozgov
## 15       Tyler Ennis
  • use filter() and then select(), to display the name and salary, of GSW point guards
GSW_point <- filter(GSW, position =="PG")
select(GSW_point, player, salary)
##             player   salary
## 1 Shaun Livingston  5782450
## 2    Stephen Curry 12112359
  • find how to select the name, age, and team, of players with more than 10 years of experience, making 10 million dollars or less.
experience10 <- filter(dat, experience > 10)
salary10m <- filter(experience10, salary < 10000000)
A <- select(salary10m, player, age, team)
A
##               player age team
## 1      Dahntay Jones  36  CLE
## 2     Deron Williams  32  CLE
## 3        James Jones  36  CLE
## 4        Kyle Korver  35  CLE
## 5  Richard Jefferson  36  CLE
## 6      Jose Calderon  35  ATL
## 7     Kris Humphries  31  ATL
## 8      Mike Dunleavy  36  ATL
## 9        Jason Terry  39  MIL
## 10        C.J. Miles  29  IND
## 11     Udonis Haslem  36  MIA
## 12        Beno Udrih  34  DET
## 13        David West  36  GSW
## 14       Matt Barnes  36  GSW
## 15  Shaun Livingston  31  GSW
## 16     Zaza Pachulia  32  GSW
## 17         David Lee  33  SAS
## 18      Lou Williams  30  HOU
## 19      Trevor Ariza  31  HOU
## 20      Brandon Bass  31  LAC
## 21       Paul Pierce  39  LAC
## 22    Raymond Felton  32  LAC
## 23        Boris Diaw  34  UTA
## 24     Nick Collison  36  OKC
## 25        Tony Allen  35  MEM
## 26      Vince Carter  40  MEM
## 27     Jameer Nelson  34  DEN
## 28       Mike Miller  36  DEN
## 29      Devin Harris  33  DAL
## 30 Metta World Peace  37  LAL
## 31   Leandro Barbosa  34  PHO
## 32      Ronnie Price  33  PHO
  • find how to select the name, team, height, and weight, of rookie players, 20 years old, displaying only the first five occurrences (i.e. rows)
age20 <- filter(dat, age == 20)
rookie <- select(age20, player,team, height, weight)
slice(rookie,1:5)
## # A tibble: 5 x 4
##            player  team height weight
##             <chr> <chr>  <int>  <int>
## 1    Jaylen Brown   BOS     79    225
## 2   Rashad Vaughn   MIL     78    202
## 3    Myles Turner   IND     83    243
## 4 Justise Winslow   MIA     79    225
## 5  Henry Ellenson   DET     83    245

Adding new variables: mutate()

Another basic verb is mutate() which allows you to add new variables. Let’s create a small data frame for the warriors with three columns: player, height, and weight:

# creating a small data frame step by step
gsw <- filter(dat, team == 'GSW')
gsw <- select(gsw, player, height, weight)
gsw <- slice(gsw, c(4, 8, 10, 14, 15))
gsw
## # A tibble: 5 x 3
##           player height weight
##            <chr>  <int>  <int>
## 1 Draymond Green     79    230
## 2   Kevin Durant     81    240
## 3  Klay Thompson     79    215
## 4  Stephen Curry     75    190
## 5  Zaza Pachulia     83    270

Now, let’s use mutate() to (temporarily) add a column with the ratio height / weight:

mutate(gsw, height / weight)
## # A tibble: 5 x 4
##           player height weight `height/weight`
##            <chr>  <int>  <int>           <dbl>
## 1 Draymond Green     79    230       0.3434783
## 2   Kevin Durant     81    240       0.3375000
## 3  Klay Thompson     79    215       0.3674419
## 4  Stephen Curry     75    190       0.3947368
## 5  Zaza Pachulia     83    270       0.3074074

You can also give a new name, like: ht_wt = height / weight:

mutate(gsw, ht_wt = height / weight)
## # A tibble: 5 x 4
##           player height weight     ht_wt
##            <chr>  <int>  <int>     <dbl>
## 1 Draymond Green     79    230 0.3434783
## 2   Kevin Durant     81    240 0.3375000
## 3  Klay Thompson     79    215 0.3674419
## 4  Stephen Curry     75    190 0.3947368
## 5  Zaza Pachulia     83    270 0.3074074

In order to permanently change the data, you need to assign the changes to an object:

gsw2 <- mutate(gsw, ht_m = height * 0.0254, wt_kg = weight * 0.4536)
gsw2
## # A tibble: 5 x 5
##           player height weight   ht_m   wt_kg
##            <chr>  <int>  <int>  <dbl>   <dbl>
## 1 Draymond Green     79    230 2.0066 104.328
## 2   Kevin Durant     81    240 2.0574 108.864
## 3  Klay Thompson     79    215 2.0066  97.524
## 4  Stephen Curry     75    190 1.9050  86.184
## 5  Zaza Pachulia     83    270 2.1082 122.472

Reordering rows: arrange()

The next basic verb of "dplyr" is arrange() which allows you to reorder rows. For example, here’s how to arrange the rows of gsw by height

# order rows by height (increasingly)
arrange(gsw, height)
## # A tibble: 5 x 3
##           player height weight
##            <chr>  <int>  <int>
## 1  Stephen Curry     75    190
## 2 Draymond Green     79    230
## 3  Klay Thompson     79    215
## 4   Kevin Durant     81    240
## 5  Zaza Pachulia     83    270

By default arrange() sorts rows in increasing order. To arrange rows in descending order you need to use the auxiliary function desc().

# order rows by height (decreasingly)
arrange(gsw, desc(height))
## # A tibble: 5 x 3
##           player height weight
##            <chr>  <int>  <int>
## 1  Zaza Pachulia     83    270
## 2   Kevin Durant     81    240
## 3 Draymond Green     79    230
## 4  Klay Thompson     79    215
## 5  Stephen Curry     75    190
# order rows by height, and then weight
arrange(gsw, height, weight)
## # A tibble: 5 x 3
##           player height weight
##            <chr>  <int>  <int>
## 1  Stephen Curry     75    190
## 2  Klay Thompson     79    215
## 3 Draymond Green     79    230
## 4   Kevin Durant     81    240
## 5  Zaza Pachulia     83    270

Your Turn

  • using the data frame gsw, add a new variable product with the product of height and weight.
mutate(gsw, product = height*weight)
## # A tibble: 5 x 4
##           player height weight product
##            <chr>  <int>  <int>   <int>
## 1 Draymond Green     79    230   18170
## 2   Kevin Durant     81    240   19440
## 3  Klay Thompson     79    215   16985
## 4  Stephen Curry     75    190   14250
## 5  Zaza Pachulia     83    270   22410
  • create a new data frame gsw3, by adding columns log_height and log_weight with the log transformations of height and weight.
gsw3 <- mutate( gsw,log_height = log(height), log_weight = log(weight))
gsw3
## # A tibble: 5 x 5
##           player height weight log_height log_weight
##            <chr>  <int>  <int>      <dbl>      <dbl>
## 1 Draymond Green     79    230   4.369448   5.438079
## 2   Kevin Durant     81    240   4.394449   5.480639
## 3  Klay Thompson     79    215   4.369448   5.370638
## 4  Stephen Curry     75    190   4.317488   5.247024
## 5  Zaza Pachulia     83    270   4.418841   5.598422
  • use the original data frame to filter() and arrange() those players with height less than 71 inches tall, in increasing order.
newheight <- filter(dat, height < 71)
arrange(newheight, height)
##          player team position height weight age experience
## 1 Isaiah Thomas  BOS       PG     69    185  27          5
## 2    Kay Felder  CLE       PG     69    176  21          0
## 3    Tyler Ulis  PHO       PG     70    150  21          0
##                    college  salary games minutes points points3 points2
## 1 University of Washington 6587132    76    2569   2199     245     437
## 2       Oakland University  543471    42     386    166       7      55
## 3   University of Kentucky  918369    61    1123    444      21     163
##   points1
## 1     590
## 2      35
## 3      55
  • display the name, team, and salary, of the top-5 highest paid players
B <- select(dat, player,team, salary)
C <- arrange(B, desc(salary))
head(C,3)
##          player team   salary
## 1  LeBron James  CLE 30963450
## 2    Al Horford  BOS 26540100
## 3 DeMar DeRozan  TOR 26540100
  • display the name, team, and salary, for the top-5 highest paid players
head(C,5)
##          player team   salary
## 1  LeBron James  CLE 30963450
## 2    Al Horford  BOS 26540100
## 3 DeMar DeRozan  TOR 26540100
## 4  Kevin Durant  GSW 26540100
## 5  James Harden  HOU 26540100
  • display the name, team, and points3, of the top 10 three-point players
D <- select(dat, player,team, points3)
E <- arrange(D, desc(points3))
head(E,10)
##            player team points3
## 1   Stephen Curry  GSW     324
## 2   Klay Thompson  GSW     268
## 3    James Harden  HOU     262
## 4     Eric Gordon  HOU     246
## 5   Isaiah Thomas  BOS     245
## 6    Kemba Walker  CHO     240
## 7    Bradley Beal  WAS     223
## 8  Damian Lillard  POR     214
## 9   Ryan Anderson  HOU     204
## 10    J.J. Redick  LAC     201
  • create a data frame gsw_mpg of GSW players, that contains variables for player name, experience, and min_per_game (minutes per game), sorted by min_per_game (in descending order)
dat1 <- mutate(dat,min_per_game=minutes/games)
gsw_mpg <- select(dat1, player,experience, min_per_game)
arrange(gsw_mpg, desc(min_per_game))
##                       player experience min_per_game
## 1               LeBron James         13    37.756757
## 2                 Kyle Lowry         10    37.400000
## 3                Zach LaVine          2    37.212766
## 4             Andrew Wiggins          2    37.170732
## 5               Jimmy Butler          5    36.960526
## 6         Karl-Anthony Towns          1    36.951220
## 7               James Harden          7    36.382716
## 8                  John Wall          6    36.358974
## 9              Anthony Davis          4    36.106667
## 10            Damian Lillard          4    35.920000
## 11               Paul George          6    35.853333
## 12     Giannis Antetokounmpo          3    35.562500
## 13           Harrison Barnes          4    35.481013
## 14             DeMar DeRozan          7    35.405405
## 15              Kyrie Irving          5    35.069444
## 16              Devin Booker          1    35.000000
## 17             C.J. McCollum          3    34.950000
## 18              Bradley Beal          4    34.857143
## 19           Justise Winslow          1    34.722222
## 20              Kemba Walker          5    34.670886
## 21              Trevor Ariza         12    34.662500
## 22         Russell Westbrook          8    34.592593
## 23            Gordon Hayward          6    34.465753
## 24           Carmelo Anthony         13    34.297297
## 25                Marc Gasol          8    34.202703
## 26           Wesley Matthews          7    34.178082
## 27             Blake Griffin          6    34.032787
## 28             Nicolas Batum          8    33.987013
## 29             Klay Thompson          5    33.961538
## 30              Paul Millsap         10    33.956522
## 31             Jabari Parker          2    33.882353
## 32               Rudy Gobert          3    33.876543
## 33          Danilo Gallinari          7    33.873016
## 34             Isaiah Thomas          5    33.802632
## 35                  Rudy Gay         10    33.766667
## 36          DeMarcus Cousins          6    33.764706
## 37              Goran Dragic          8    33.684932
## 38             Kawhi Leonard          5    33.432432
## 39             Stephen Curry          7    33.392405
## 40              Kevin Durant          9    33.387097
## 41             Avery Bradley          6    33.363636
## 42  Kentavious Caldwell-Pope          3    33.276316
## 43               Mike Conley          9    33.217391
## 44            Victor Oladipo          3    33.164179
## 45              Eric Bledsoe          6    32.969697
## 46               Ricky Rubio          5    32.920000
## 47             Evan Fournier          4    32.852941
## 48        Kristaps Porzingis          1    32.787879
## 49              Jrue Holiday          7    32.686567
## 50          Hassan Whiteside          4    32.636364
## 51               Otto Porter          3    32.562500
## 52              Derrick Rose          7    32.531250
## 53            Draymond Green          4    32.513158
## 54             Marcus Morris          5    32.468354
## 55               Jae Crowder          4    32.430556
## 56         LaMarcus Aldridge         10    32.430556
## 57               Jeff Teague          7    32.402439
## 58              Gorgui Dieng          3    32.353659
## 59                Al Horford          9    32.250000
## 60              Courtney Lee          8    31.935065
## 61            DeAndre Jordan          8    31.728395
## 62          Robert Covington          3    31.626866
## 63               George Hill          8    31.510204
## 64                Chris Paul         11    31.491803
## 65           Dennis Schroder          3    31.455696
## 66                Kevin Love          8    31.416667
## 67              Myles Turner          1    31.370370
## 68             Tobias Harris          5    31.304878
## 69               Gary Harris          2    31.263158
## 70           Markieff Morris          5    31.236842
## 71             Marcin Gortat          9    31.170732
## 72             Terrence Ross          4    31.166667
## 73               T.J. Warren          2    31.030303
## 74               Eric Gordon          8    30.973333
## 75               Serge Ibaka          7    30.956522
## 76           Wilson Chandler          8    30.943662
## 77          Patrick Beverley          4    30.716418
## 78           Khris Middleton          4    30.655172
## 79           Josh Richardson          1    30.452830
## 80              Marcus Smart          2    30.367089
## 81           Darren Collison          7    30.338235
## 82            Thaddeus Young          9    30.229730
## 83           Marvin Williams         11    30.197368
## 84              Dion Waiters          4    30.086957
## 85            Andre Roberson          3    30.075949
## 86          Tristan Thompson          5    29.948718
## 87               Dwyane Wade         13    29.866667
## 88              Steven Adams          3    29.862500
## 89             Tyler Johnson          2    29.835616
## 90            Andre Drummond          4    29.740741
## 91             Dwight Howard         12    29.716216
## 92              Solomon Hill          3    29.675000
## 93               Brook Lopez          8    29.626667
## 94             Elfrid Payton          2    29.414634
## 95             Ryan Anderson          8    29.388889
## 96           Jordan Clarkson          2    29.231707
## 97                Tony Snell          3    29.200000
## 98              Jusuf Nurkic          2    29.200000
## 99               Buddy Hield          0    29.080000
## 100          Al-Farouq Aminu          6    29.065574
## 101             Yogi Ferrell          0    29.055556
## 102   Michael Kidd-Gilchrist          4    29.000000
## 103               Seth Curry          3    28.985714
## 104               J.R. Smith         12    28.951220
## 105         Maurice Harkless          4    28.870130
## 106           Brandon Ingram          0    28.848101
## 107           Nikola Vucevic          5    28.840000
## 108            Julius Randle          2    28.810811
## 109         D'Angelo Russell          1    28.746032
## 110             Aaron Gordon          2    28.725000
## 111             Allen Crabbe          3    28.531646
## 112              Will Barton          4    28.416667
## 113              J.J. Redick         10    28.179487
## 114              Robin Lopez          8    28.037037
## 115             Nikola Jokic          1    27.917808
## 116              Cody Zeller          3    27.822581
## 117            Austin Rivers          4    27.756757
## 118           Tyson Chandler         15    27.617021
## 119            James Johnson          7    27.434211
## 120           Reggie Jackson          5    27.384615
## 121             Nik Stauskas          2    27.350000
## 122           JaMychal Green          2    27.285714
## 123            Jameer Nelson         12    27.266667
## 124             Tim Hardaway          3    27.265823
## 125              Monta Ellis         11    27.000000
## 126              Rodney Hood          2    27.000000
## 127               Tony Allen         12    26.957746
## 128            Kent Bazemore          4    26.890411
## 129              Rajon Rondo         10    26.710145
## 130           Garrett Temple          6    26.584615
## 131              Danny Green          7    26.573529
## 132                Luol Deng         12    26.535714
## 133          Malcolm Brogdon          0    26.426667
## 134            Dirk Nowitzki         18    26.370370
## 135           T.J. McConnell          1    26.333333
## 136           Jamal Crawford         16    26.304878
## 137           Andre Iguodala         12    26.289474
## 138              Dario Saric          0    26.283951
## 139           Alex Poythress          0    26.166667
## 140          DeMarre Carroll          7    26.138889
## 141      Matthew Dellavedova          3    26.131579
## 142           Frank Kaminsky          1    26.053333
## 143               Nick Young          9    25.933333
## 144                Jon Leuer          5    25.920000
## 145            Arron Afflalo          9    25.901639
## 146        Jonas Valanciunas          4    25.825000
## 147          Thabo Sefolosha         10    25.741935
## 148             Lou Williams         11    25.695652
## 149          Emmanuel Mudiay          1    25.563636
## 150              Evan Turner          6    25.507692
## 151            Iman Shumpert          5    25.486842
## 152                Pau Gasol         15    25.421875
## 153              P.J. Tucker          5    25.375000
## 154              Joel Embiid          0    25.354839
## 155          Rodney McGruder          0    25.205128
## 156              Tony Parker         15    25.190476
## 157                Ty Lawson          7    25.101449
## 158          Sean Kilpatrick          2    25.057143
## 159              Cory Joseph          5    25.037500
## 160          DeAndre Liggins          3    25.000000
## 161         Dante Cunningham          7    24.984848
## 162            E'Twaun Moore          5    24.931507
## 163            Trevor Booker          6    24.704225
## 164             Vince Carter         18    24.643836
## 165        Patrick Patterson          6    24.600000
## 166              Kyle Korver         13    24.542857
## 167               Jeremy Lin          6    24.527778
## 168            Zach Randolph         15    24.465753
## 169           Ersan Ilyasova          8    24.346154
## 170          Wayne Ellington          7    24.193548
## 171                Ish Smith          6    24.135802
## 172               Joe Ingles          2    24.048780
## 173          Marco Belinelli          9    24.027027
## 174              Edy Tavares          1    24.000000
## 175           Nikola Mirotic          2    23.985714
## 176             Clint Capela          2    23.861538
## 177           Derrick Favors          6    23.720000
## 178           Jerryd Bayless          8    23.666667
## 179              Joe Johnson         15    23.628205
## 180              Tim Frazier          2    23.461538
## 181              James Ennis          2    23.453125
## 182            Mason Plumlee          3    23.407407
## 183               C.J. Miles         11    23.368421
## 184          Jordan Crawford          4    23.263158
## 185            Troy Williams          0    23.166667
## 186         Gerald Henderson          7    23.152778
## 187         Bojan Bogdanovic          2    23.115385
## 188          Larry Nance Jr.          1    22.888889
## 189         Anthony Tolliver          8    22.723077
## 190            Jahlil Okafor          1    22.680000
## 191        Spencer Dinwiddie          2    22.610169
## 192  Rondae Hollis-Jefferson          1    22.576923
## 193         Isaiah Whitehead          0    22.506849
## 194              Greg Monroe          6    22.506173
## 195             Tyreke Evans          7    22.428571
## 196         Luc Mbah a Moute          8    22.337500
## 197         Sergio Rodriguez          4    22.323529
## 198               Jeff Green          8    22.231884
## 199          Bismack Biyombo          5    22.135802
## 200              Joakim Noah          9    22.065217
## 201               J.J. Barea         10    22.028571
## 202         Lance Stephenson          6    22.000000
## 203             Nerlens Noel          2    21.954545
## 204              Patty Mills          7    21.925000
## 205             Brandon Rush          8    21.914894
## 206             Shelvin Mack          5    21.909091
## 207               Joe Harris          2    21.884615
## 208             Caris LeVert          0    21.701754
## 209          Justin Anderson          1    21.583333
## 210             Jamal Murray          0    21.512195
## 211              Enes Kanter          5    21.291667
## 212             Jared Dudley          9    21.281250
## 213          Marquese Chriss          0    21.256098
## 214           Raymond Felton         11    21.250000
## 215           Kenneth Faried          5    21.245902
## 216               Taj Gibson          7    21.173913
## 217           Brandon Knight          5    21.111111
## 218             Lance Thomas          5    21.043478
## 219           Richaun Holmes          1    20.929825
## 220             Kelly Olynyk          3    20.506667
## 221              Jodie Meeks          7    20.500000
## 222              Matt Barnes         13    20.500000
## 223             Axel Toupane          1    20.500000
## 224          Andrew Harrison          0    20.472222
## 225           Timofey Mozgov          6    20.444444
## 226        Richard Jefferson         15    20.430380
## 227      Dorian Finney-Smith          0    20.271605
## 228                 Alex Len          3    20.259740
## 229           Deron Williams         11    20.250000
## 230         Domantas Sabonis          0    20.148148
## 231             Amir Johnson         11    20.100000
## 232           Justin Holiday          3    19.987805
## 233             Kosta Koufos          8    19.985915
## 234         Chandler Parsons          5    19.852941
## 235              David Nwaba          0    19.850000
## 236        Langston Galloway          2    19.736842
## 237            D.J. Augustin          8    19.717949
## 238           Doug McDermott          2    19.545455
## 239         Shabazz Muhammad          3    19.435897
## 240              John Henson          4    19.362069
## 241             Ben McLemore          3    19.278689
## 242              Paul Zipser          0    19.159091
## 243             Jerami Grant          2    19.102564
## 244           Lucas Nogueira          2    19.087719
## 245      Willie Cauley-Stein          1    18.946667
## 246            Channing Frye         10    18.891892
## 247  Michael Carter-Williams          3    18.800000
## 248            Manu Ginobili         14    18.710145
## 249                David Lee         11    18.696203
## 250               Randy Foye         10    18.608696
## 251               Dante Exum          1    18.606061
## 252          Skal Labissiere          0    18.545455
## 253              Jason Terry         17    18.445946
## 254              Jeremy Lamb          4    18.435484
## 255               Sam Dekker          1    18.428571
## 256               Tyler Ulis          0    18.409836
## 257          Justin Hamilton          2    18.390625
## 258        Willy Hernangomez          0    18.388889
## 259         Montrezl Harrell          1    18.344828
## 260          Nemanja Bjelica          1    18.307692
## 261            Zaza Pachulia         13    18.114286
## 262            Norman Powell          1    18.000000
## 263              Ian Mahinmi          8    17.903226
## 264         Jonathon Simmons          1    17.846154
## 265              Tyler Ennis          2    17.818182
## 266          Stanley Johnson          1    17.805195
## 267         Shaun Livingston         11    17.697368
## 268             Mike Muscala          3    17.671429
## 269             Troy Daniels          3    17.656716
## 270               Boris Diaw         13    17.575342
## 271           Dewayne Dedmon          3    17.500000
## 272           Josh McRoberts          9    17.318182
## 273            Dwight Powell          2    17.311688
## 274  Timothe Luwawu-Cabarrot          0    17.246377
## 275             Jaylen Brown          0    17.192308
## 276             Wayne Selden          0    17.181818
## 277                 Ed Davis          6    17.152174
## 278         Denzel Valentine          0    17.122807
## 279          Malcolm Delaney          0    17.095890
## 280              Noah Vonleh          2    17.094595
## 281                Kris Dunn          0    17.089744
## 282         Derrick Williams          5    17.080000
## 283              Omri Casspi          7    17.076923
## 284             Terry Rozier          1    17.067568
## 285            Derrick Jones          0    17.031250
## 286             Devin Harris         12    16.723077
## 287          Michael Beasley          8    16.696429
## 288             Delon Wright          1    16.518519
## 289           Meyers Leonard          4    16.513514
## 290                Ron Baker          0    16.480769
## 291              C.J. Watson          9    16.322581
## 292             Jerian Grant          1    16.317460
## 293               Trey Lyles          1    16.309859
## 294              Tarik Black          2    16.283582
## 295         Brandon Jennings          7    16.260870
## 296           Ramon Sessions          9    16.220000
## 297          Mirza Teletovic          4    16.185714
## 298     Georgios Papagiannis          0    16.136364
## 299              Ivica Zubac          0    16.026316
## 300           Brandan Wright          8    15.964286
## 301               Quincy Acy          4    15.937500
## 302            Mike Dunleavy         14    15.833333
## 303            Jonas Jerebko          6    15.794872
## 304        Cristiano Felicio          1    15.757576
## 305        Marreese Speights          8    15.682927
## 306             Luke Babbitt          6    15.661765
## 307             Bobby Portis          1    15.625000
## 308            Pascal Siakam          0    15.618182
## 309           Darrell Arthur          7    15.585366
## 310             Kyle O'Quinn          4    15.556962
## 311                Omer Asik          6    15.548387
## 312               Alec Burks          5    15.547619
## 313             Alex Abrines          0    15.514706
## 314              Aron Baynes          4    15.506667
## 315             Josh Huestis          1    15.500000
## 316           Archie Goodwin          3    15.333333
## 317           Semaj Christon          0    15.203125
## 318            Isaiah Canaan          3    15.179487
## 319            Patrick McCaw          0    15.126761
## 320           Reggie Bullock          3    15.064516
## 321            Alan Williams          1    15.063830
## 322            Alexis Ajinca          6    14.974359
## 323     Mindaugas Kuzminskas          0    14.941176
## 324             Corey Brewer          9    14.916667
## 325            Mario Hezonja          1    14.769231
## 326                Ian Clark          3    14.766234
## 327           K.J. McDaniels          2    14.650000
## 328            Jose Calderon         11    14.529412
## 329              Willie Reed          1    14.521127
## 330              Jason Smith          8    14.432432
## 331          Leandro Barbosa         13    14.373134
## 332               Beno Udrih         12    14.358974
## 333              Lavoy Allen          5    14.278689
## 334            Kyle Anderson          2    14.166667
## 335             Al Jefferson         12    14.106061
## 336       Donatas Motiejunas          4    14.088235
## 337             Aaron Brooks          8    13.753846
## 338         Juan Hernangomez          0    13.580645
## 339              Okaro White          0    13.457143
## 340            Miles Plumlee          4    13.384615
## 341            Dragan Bender          0    13.348837
## 342            Jarell Martin          1    13.285714
## 343               Shawn Long          0    13.000000
## 344            Isaiah Taylor          0    13.000000
## 345            Cameron Payne          1    12.909091
## 346               Tyus Jones          1    12.900000
## 347            Jarrod Uthoff          0    12.777778
## 348         Tomas Satoransky          0    12.614035
## 349               David West         13    12.558824
## 350           Chasson Randle          0    12.500000
## 351              Salah Mejri          1    12.397260
## 352               Trey Burke          3    12.333333
## 353               Quinn Cook          0    12.333333
## 354           Kris Humphries         12    12.303571
## 355             Wade Baldwin          0    12.272727
## 356            Briante Weber          1    12.230769
## 357            Davis Bertans          0    12.059701
## 358        Joffrey Lauvergne          2    12.050000
## 359             Kyle Singler          4    12.031250
## 360            Dahntay Jones         12    12.000000
## 361           Wesley Johnson          6    11.911765
## 362            Cheick Diallo          0    11.705882
## 363          Thomas Robinson          4    11.666667
## 364             Jakob Poeltl          0    11.592593
## 365           Elijah Millsap          2    11.500000
## 366             Gerald Green          9    11.446809
## 367           Kevin Seraphin          6    11.408163
## 368            Rashad Vaughn          1    11.170732
## 369         Andrew Nicholson          4    11.100000
## 370             Brandon Bass         11    11.096154
## 371              Paul Pierce         18    11.080000
## 372           Chinanu Onuaku          0    10.400000
## 373            Maurice Ndour          0    10.343750
## 374             Tyler Zeller          4    10.294118
## 375            Alan Anderson          7    10.266667
## 376            Brian Roberts          4    10.146341
## 377               Thon Maker          0     9.859649
## 378          Darrun Hilliard          1     9.769231
## 379          DeAndre' Bembry          0     9.763158
## 380            Sasha Vujacic          9     9.714286
## 381           Anthony Morrow          8     9.666667
## 382           Shabazz Napier          2     9.660377
## 383         Nicolas Brussino          0     9.648148
## 384              Norris Cole          5     9.615385
## 385      Marcus Georges-Hunt          0     9.600000
## 386             JaVale McGee          8     9.597403
## 387             Ronnie Price         11     9.571429
## 388        Sheldon McClellan          0     9.566667
## 389           Tiago Splitter          6     9.500000
## 390               Kay Felder          0     9.190476
## 391            Spencer Hawes          9     9.000000
## 392       Malachi Richardson          0     9.000000
## 393     James Michael McAdoo          2     8.788462
## 394                Raul Neto          1     8.650000
## 395          Patricio Garino          0     8.600000
## 396             Cole Aldrich          6     8.564516
## 397          Johnny O'Bryant          2     8.500000
## 398             Damian Jones          0     8.500000
## 399          Dejounte Murray          0     8.473684
## 400              Jeff Withey          3     8.470588
## 401             Kevon Looney          1     8.433962
## 402         Boban Marjanovic          1     8.371429
## 403           Christian Wood          1     8.230769
## 404          Pat Connaughton          1     8.102564
## 405         Marshall Plumlee          0     8.095238
## 406            Fred VanVleet          0     7.945946
## 407              James Jones         13     7.937500
## 408              Bryn Forbes          0     7.916667
## 409           Henry Ellenson          0     7.684211
## 410            Udonis Haslem         13     7.647059
## 411              James Young          2     7.586207
## 412         Rakeem Christmas          1     7.551724
## 413              Mike Miller         16     7.550000
## 414            Malik Beasley          0     7.500000
## 415            Adreian Payne          2     7.500000
## 416             A.J. Hammons          0     7.409091
## 417              Jake Layman          0     7.114286
## 418           Treveon Graham          0     7.000000
## 419             Damjan Rudez          2     6.977778
## 420               Ryan Kelly          3     6.875000
## 421              Jordan Hill          7     6.714286
## 422            Deyonta Davis          0     6.611111
## 423             Joel Anthony          9     6.421053
## 424            Nick Collison         12     6.400000
## 425        Metta World Peace         16     6.400000
## 426        Stephen Zimmerman          0     5.684211
## 427            Jordan Mickey          1     5.640000
## 428           Tim Quarterman          0     5.000000
## 429              Bobby Brown          2     4.920000
## 430            Bruno Caboclo          2     4.444444
## 431            Joel Bolomboy          0     4.416667
## 432                Joe Young          1     4.090909
## 433            Georges Niang          0     4.043478
## 434         Chris McCullough          1     4.000000
## 435            Daniel Ochefu          0     3.947368
## 436          Michael Gbinije          0     3.555556
## 437            Diamond Stone          0     3.428571
## 438        Demetrius Jackson          0     3.400000
## 439             Kyle Wiltjer          0     3.142857
## 440            Brice Johnson          0     3.000000
## 441              Roy Hibbert          8     1.833333

Summarizing values with summarise()

The next verb is summarise(). Conceptually, this involves applying a function on one or more columns, in order to summarize values. This is probably easier to understand with one example.

Say you are interested in calculating the average salary of all NBA players. To do this “a la dplyr” you use summarise(), or its synonym function summarize():

# average salary of NBA players
summarise(dat, avg_salary = mean(salary))
##   avg_salary
## 1    6187014

Calculating an average like this seems a bit verbose, especially when you can directly use mean() like this:

mean(dat$salary)
## [1] 6187014

So let’s make things a bit more interessting. What if you want to calculate some summary statistics for salary: min, median, mean, and max?

# some stats for salary (dplyr)
summarise(
  dat, 
  min = min(salary),
  median = median(salary),
  avg = mean(salary),
  max = max(salary)
)
##    min  median     avg      max
## 1 5145 3500000 6187014 30963450

Well, this may still look like not much. You can do the same in base R (there are actually better ways to do this):

# some stats for salary (base R)
c(min = min(dat$salary), 
  median = median(dat$salary),
  median = mean(dat$salary),
  max = max(dat$salary))
##      min   median   median      max 
##     5145  3500000  6187014 30963450

Grouped operations

To actually appreciate the power of summarise(), we need to introduce the other major basic verb in "dplyr": group_by(). This is the function that allows you to perform data aggregations, or grouped operations.

Let’s see the combination of summarise() and group_by() to calculate the average salary by team:

# average salary, grouped by team
summarise(
  group_by(dat, team),
  avg_salary = mean(salary)
)
## # A tibble: 30 x 2
##     team avg_salary
##    <chr>      <dbl>
##  1   ATL    6491892
##  2   BOS    6127673
##  3   BRK    4363414
##  4   CHI    6138459
##  5   CHO    6683086
##  6   CLE    8386014
##  7   DAL    6139880
##  8   DEN    5225533
##  9   DET    6871594
## 10   GSW    6579394
## # ... with 20 more rows

Here’s a similar example with the average salary by position:

# average salary, grouped by position
summarise(
  group_by(dat, position),
  avg_salary = mean(salary)
)
## # A tibble: 5 x 2
##   position avg_salary
##      <chr>      <dbl>
## 1        C    6987682
## 2       PF    5890363
## 3       PG    6069029
## 4       SF    6513374
## 5       SG    5535260

Here’s a more fancy example: average weight and height, by position, displayed in desceding order by average height:

arrange(
  summarise(
    group_by(dat, position),
    avg_height = mean(height),
    avg_weight = mean(weight)),
  desc(avg_height)
)
## # A tibble: 5 x 3
##   position avg_height avg_weight
##      <chr>      <dbl>      <dbl>
## 1        C   83.25843   250.7978
## 2       PF   81.50562   235.8539
## 3       SF   79.63855   220.4699
## 4       SG   77.02105   204.7684
## 5       PG   74.30588   188.5765

Your turn:

  • use summarise() to get the largest height value.
summarise(dat,max_height=max(height))
##   max_height
## 1         87
  • use summarise() to get the standard deviation of points3.
summarise(dat, sd_p3=sd(points3))
##     sd_p3
## 1 55.9721
  • use summarise() and group_by() to display the median of three-points, by team.
summarise(
  group_by(dat,team),
  median_3p= median(points3)
  )
## # A tibble: 30 x 2
##     team median_3p
##    <chr>     <dbl>
##  1   ATL      32.5
##  2   BOS      46.0
##  3   BRK      44.0
##  4   CHI      32.0
##  5   CHO      17.0
##  6   CLE      62.0
##  7   DAL      53.0
##  8   DEN      53.0
##  9   DET      28.0
## 10   GSW      18.0
## # ... with 20 more rows
  • display the average triple points by team, in ascending order, of the bottom-5 teams (worst 3pointer teams)
summarise(
  group_by(dat,team),
  avg_3p= mean(points3)
  )
## # A tibble: 30 x 2
##     team   avg_3p
##    <chr>    <dbl>
##  1   ATL 44.71429
##  2   BOS 65.66667
##  3   BRK 49.20000
##  4   CHI 37.66667
##  5   CHO 53.86667
##  6   CLE 67.46667
##  7   DAL 50.26667
##  8   DEN 57.86667
##  9   DET 42.06667
## 10   GSW 65.46667
## # ... with 20 more rows
  • obtain the mean and standard deviation of age, for Power Forwards, with 5 and 10 years (including) years of experience.
experience5_10 <- select(dat, dat$experience[between(dat$experience,5,10)])
summarise(
  group_by(experience5_10, position="PF"),
  m_age=mean(age),
  sd_age=sd(age)
  
)
## # A tibble: 1 x 3
##   position    m_age   sd_age
##      <chr>    <dbl>    <dbl>
## 1       PF 26.29252 4.331509

First contact with ggplot()

The package "ggplot2" is probably the most popular package in R to create beautiful static graphics. Comapred to the functions in the base package "graphcics", the package "ggplot2" follows a somewhat different philosophy, and it tries to be more consistent and modular as possible.

Scatterplots

Let’s start with a scatterplot of salary and points

# scatterplot (option 1)
ggplot(data = dat) +
  geom_point(aes(x = points, y = salary))

  • ggplot() creates an object of class "ggplot"
  • the main input for ggplot() is data which must be a data frame
  • then we use the "+" operator to add a layer
  • the geometric object (geom) are points: geom_points()
  • aes() is used to specify the x and y coordinates, by taking columns points and salary from the data frame

The same scatterplot can also be created with this alternative, and more common use of ggplot()

# scatterplot (option 2)
ggplot(data = dat, aes(x = points, y = salary)) +
  geom_point()

Label your chunks!

Adding color

Say you want to color code the points in terms of position

# colored scatterplot 
ggplot(data = dat, aes(x = points, y = salary)) +
  geom_point(aes(color = position))

Maybe you wan to modify the size of the dots in terms of points3:

# sized and colored scatterplot 
ggplot(data = dat, aes(x = points, y = salary)) +
  geom_point(aes(color = position, size = points3))

To add some transparency effect to the dots, you can use the alpha parameter.

# sized and colored scatterplot 
ggplot(data = dat, aes(x = points, y = salary)) +
  geom_point(aes(color = position, size = points3), alpha = 0.7)

Notice that alpha was specified outside aes(). This is because we are not using any column for the alpha transparency values.

Your turn:

  • Open the ggplot2 cheatsheet
  • Use the data frame gsw to make a scatterplot of height and weight.
ggplot(gsw, aes(height, weight))+
  geom_point()

  • Find out how to make another scatterplot of height and weight,
ggplot(gsw)+
  geom_point(aes(height, weight))

geom_text(aes(player)) using geom_text() to display the names of the players.

ggplot(gsw)+
  geom_point(aes(height, weight))+
  geom_text(aes(height, weight, label=player),nudge_x = 1, nudge_y = 1,check_overlap = T)

- Get a scatter plot of height and weight, for ALL the warriors, displaying their names with geom_label().

ggplot(GSW)+
  geom_point(aes(height, weight))+
  geom_label(aes(height, weight, label=player), nudge_x = 0, nudge_y = 0,check_overlap = F)
## Warning: Ignoring unknown parameters: check_overlap

  • Get a density plot of salary (for all NBA players).
ggplot(dat,aes(salary))+
  geom_density()

  • Get a histogram of points2 with binwidth of 50 (for all NBA players).
ggplot(dat,aes(points2))+
  geom_histogram(binwidth=50)

  • Get a barchart of the position frequencies (for all NBA players).
ggplot(dat,aes(position))+
  geom_bar()

  • Make a scatterplot of experience and salary of all Centers, and use geom_smooth() to add a regression line.
Center <- filter(dat, position=="C")
ggplot(Center,aes(experience,salary))+
  geom_smooth()
## `geom_smooth()` using method = 'loess'

  • Repeat the same scatterplot of experience and salary of all Centers, but now use geom_smooth() to add a loess line (i.e. smooth line).
Center <- filter(dat, position=="C")
ggplot(Center,aes(experience,salary))+
  geom_smooth(method = loess)

Faceting

One of the most attractive features of "ggplot2" is the ability to display multiple facets. The idea of facets is to divide a plot into subplots based on the values of one or more categorical (or discrete) variables.

Here’s an example. What if you want to get scatterplots of points and salary separated (or grouped) by position? This is where faceting comes handy, and you can use facet_warp() for this purpose:

# scatterplot by position
ggplot(data = dat, aes(x = points, y = salary)) +
  geom_point() +
  facet_wrap(~ position)

The other faceting function is facet_grid(), which allows you to control the layout of the facets (by rows, by columns, etc)

# scatterplot by position
ggplot(data = dat, aes(x = points, y = salary)) +
  geom_point(aes(color = position), alpha = 0.7) +
  facet_grid(~ position) +
  geom_smooth(method = loess)

# scatterplot by position
ggplot(data = dat, aes(x = points, y = salary)) +
  geom_point(aes(color = position), alpha = 0.7) +
  facet_grid(position ~ .) +
  geom_smooth(method = loess)

Your turn:

  • Make scatterplots of experience and salary faceting by position
ggplot(data = dat, aes(x = experience, y = salary)) +
  geom_point() +
  facet_wrap(~ position)

  • Make scatterplots of experience and salary faceting by team
ggplot(data = dat, aes(x = experience, y = salary)) +
  geom_point() +
  facet_wrap(~ team)

- Make density plots of age faceting by team

ggplot(data = dat, aes(age)) +
  geom_density() +
  facet_wrap(~ team)

- Make scatterplots of height and weight faceting by position

ggplot(data = dat, aes(x = height, y = weight)) +
  geom_point() +
  facet_wrap(~ position)

- Make scatterplots of height and weight, with a 2-dimensional density, geom_density2d(), faceting by position

ggplot(data = dat, aes(x = height, y = weight)) +
  geom_density2d() +
  facet_wrap(~ position)

- Make a scatterplot of experience and salary for the Warriors, but this time add a layer with theme_bw() to get a simpler background

ggplot(data = dat, aes(x = experience, y = salary)) +
  theme_bw() 

- Repeat any of the previous plots but now adding a leyer with another theme e.g. theme_minimal(), theme_dark(), theme_classic()

ggplot(data = dat, aes(x = experience, y = salary)) +
  theme_minimal()

More shell commands

Now that you have a bunch of images inside the images/ subdirectory, let’s keep practicing some basic commands.

  • Open the terminal.
  • Move inside the images/ directory of the lab.
  • List the contents of this directory.
  • Now list the contents of the directory in long format.
  • How would you list the contents in long format, by time?
  • How would you list the contents displaying the results in reverse (alphabetical)? order
  • Without changing your current directory, create a directory copies at the parent level (i.e. lab05/).
  • Copy one of the PNG files to the copies folder.
  • Use the wildcard * to copy all the .png files in the directory copies.
  • Change to the directory copies.
  • Use the command mv to rename some of your PNG files.
  • Change to the report/ directory.
  • From within report/, find out how to rename the directory copies as copy-files.
  • From within report/, delete one or two PNG files in copy-files.
  • From within report/, find out how to delete the directory copy-files.